{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qQG4iSxVxw8f"
      },
      "source": [
        "# RAGAS Evaluation for LangChain Agents"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1eOnr6z_zLoc",
        "outputId": "b43767c1-4b53-498c-c21e-922b7d3c4696"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Python 3.10.12\n"
          ]
        }
      ],
      "source": [
        "!python --version"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ov6TCS7bx1oI"
      },
      "source": [
        "**R**etrieval **A**ugmented **G**eneration **As**sessment (RAGAS) is an evaluation framework for quantifying the performances of our RAG pipelines. In this example we will see how to use it with a RAG-enabled conversational agent in LangChain.\n",
        "\n",
        "Because we need an agent and RAG pipeline to evaluate RAGAS the first part of this notebook covers setting up an XML Agent with RAG. Jump ahead to **Integrating RAGAS** for the RAGAS section.\n",
        "\n",
        "To begin, let's install the prerequisites:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zshhLDrgbFKk",
        "outputId": "48f7b79b-9fe6-459a-a207-3234d4a52a5f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m802.4/802.4 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m35.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m846.7/846.7 kB\u001b[0m \u001b[31m36.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m52.1/52.1 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m201.4/201.4 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m507.1/507.1 kB\u001b[0m \u001b[31m25.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m65.4/65.4 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m242.1/242.1 kB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m56.5/56.5 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m53.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m80.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m226.7/226.7 kB\u001b[0m \u001b[31m20.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m71.1/71.1 kB\u001b[0m \u001b[31m7.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m77.0/77.0 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m241.3/241.3 kB\u001b[0m \u001b[31m21.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m241.2/241.2 kB\u001b[0m \u001b[31m22.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m55.4/55.4 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "!pip install -qU \\\n",
        "    langchain==0.1.1 \\\n",
        "    langchain-community==0.0.13 \\\n",
        "    langchainhub==0.1.14 \\\n",
        "    anthropic==0.14.0 \\\n",
        "    cohere==4.45 \\\n",
        "    pinecone-client==3.1.0 \\\n",
        "    datasets==2.16.1 \\\n",
        "    ragas==0.1.0"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "VeosBMaIDs52"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "# dashboard.cohere.com\n",
        "os.environ[\"COHERE_API_KEY\"] = \"<<YOUR_KEY>>\" or getpass(\"Cohere API key: \")\n",
        "# app.pinecone.io\n",
        "os.environ[\"PINECONE_API_KEY\"] = \"<<YOUR_KEY>>\" or getpass(\"Pinecone API key: \")\n",
        "# console.anthropic.com\n",
        "os.environ[\"ANTHROPIC_API_KEY\"] = \"<<YOUR_KEY>>\" or getpass(\"Anthropic API key: \")\n",
        "# platform.openai.com\n",
        "os.environ[\"OPENAI_API_KEY\"] = \"<<YOUR_KEY>>\" or getpass(\"OpenAI API key: \")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bpKfZkUYzQhB"
      },
      "source": [
        "## Finding Knowledge"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JDTQoxcNzUa8"
      },
      "source": [
        "The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).\n",
        "\n",
        "_Note: we're using the prechunked dataset. For the raw version see [`jamescalam/ai-arxiv2`](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 297,
          "referenced_widgets": [
            "4b77ad5351bb4025be38d174ee9288a2",
            "e0be261c59d24b6188440a59391c16b8",
            "d450832f530c430cbb8040a1febb8d67",
            "b4ddf1bfdfe94e879c68b9ecfeae423e",
            "1963e604f6f84d1394ef0f5329bfed83",
            "83dfc2c36bd54ec6867d76a4f23e6726",
            "9dcbadd1b0864b99bb8ddc7631d99de0",
            "a06cec752c844242812abaec6aab0ad0",
            "6e374f4315674d6aaa0ff137aa50cac5",
            "9d5b412671284b85ad87d0fc93cc517b",
            "0e1562bc68a2493f8714fe490b21f0a0",
            "fc58156677c04c30a8d0c468c2a9421b",
            "2c6bcde588684c74b31e0ba64652bf10",
            "f923f5f768d548f88baa37bade011d7b",
            "e952debe03384be983daaff7306a2a1e",
            "d8e21208bded4906967180f28557b046",
            "aab214cee4414cfb95a525ea11635d40",
            "aaa3db9ab46a439a82ab82e308820df4",
            "c467e357ea5742efaf9d07c67b33bab6",
            "e503d6bc4517471fa75e4cd62d9081ac",
            "8482c2ffa17444f88d6c46163df3499a",
            "762eea88a5824b3ebe599eb82c48788a"
          ]
        },
        "id": "U9gpYFnzbFKm",
        "outputId": "4d65517e-e5b9-4182-b0cb-9a5cb60f2438"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n",
            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
            "You will be able to reuse this secret in all of your notebooks.\n",
            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4b77ad5351bb4025be38d174ee9288a2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data:   0%|          | 0.00/766M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "fc58156677c04c30a8d0c468c2a9421b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split: 0 examples [00:00, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],\n",
              "    num_rows: 20000\n",
              "})"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset = load_dataset(\"jamescalam/ai-arxiv2-chunks\", split=\"train[:20000]\")\n",
        "dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_bP7ZW-ybFKm",
        "outputId": "f60b1622-3462-4d0d-9845-c333667eab2f"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'doi': '2401.09350',\n",
              " 'chunk-id': 1,\n",
              " 'chunk': 'These neural networks and their training algorithms may be complex, and the scope of their impact broad and wide, but nonetheless they are simply functions in a high-dimensional space. A trained neural network takes a vector as input, crunches and transforms it in various ways, and produces another vector, often in some other space. An image may thereby be turned into a vector, a song into a sequence of vectors, and a social network as a structured collection of vectors. It seems as though much of human knowledge, or at least what is expressed as text, audio, image, and video, has a vector representation in one form or another.\\nIt should be noted that representing data as vectors is not unique to neural networks and deep learning. In fact, long before learnt vector representations of pieces of data\u00e2\\x80\\x94what is commonly known as \u00e2\\x80\\x9cembeddings\u00e2\\x80\\x9d\u00e2\\x80\\x94came along, data was often encoded as hand-crafted feature vectors. Each feature quanti- fied into continuous or discrete values some facet of the data that was deemed relevant to a particular task (such as classification or regression). Vectors of that form, too, reflect our understanding of a real-world object or concept.',\n",
              " 'id': '2401.09350#1',\n",
              " 'title': 'Foundations of Vector Retrieval',\n",
              " 'summary': 'Vectors are universal mathematical objects that can represent text, images,\\nspeech, or a mix of these data modalities. That happens regardless of whether\\ndata is represented by hand-crafted features or learnt embeddings. Collect a\\nlarge enough quantity of such vectors and the question of retrieval becomes\\nurgently relevant: Finding vectors that are more similar to a query vector.\\nThis monograph is concerned with the question above and covers fundamental\\nconcepts along with advanced data structures and algorithms for vector\\nretrieval. In doing so, it recaps this fascinating topic and lowers barriers of\\nentry into this rich area of research.',\n",
              " 'source': 'http://arxiv.org/pdf/2401.09350',\n",
              " 'authors': 'Sebastian Bruch',\n",
              " 'categories': 'cs.DS, cs.IR',\n",
              " 'comment': None,\n",
              " 'journal_ref': None,\n",
              " 'primary_category': 'cs.DS',\n",
              " 'published': '20240117',\n",
              " 'updated': '20240117',\n",
              " 'references': []}"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dataset[1]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VX6NdQhgbFKn"
      },
      "source": [
        "## Building the Knowledge Base"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MDCbqQl_bFKn"
      },
      "source": [
        "To build our knowledge base we need _two things_:\n",
        "\n",
        "1. Embeddings, for this we will use `CohereEmbeddings` using Cohere's embedding models, which do need an [API key](https://dashboard.cohere.com/api-keys).\n",
        "2. A vector database, where we store our embeddings and query them. We use Pinecone which again requires a [free API key](https://app.pinecone.io).\n",
        "\n",
        "First we initialize our connection to Cohere and define an `embed` helper function:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "wkw0KyLRbFKo"
      },
      "outputs": [],
      "source": [
        "from langchain_community.embeddings import CohereEmbeddings\n",
        "\n",
        "embed = CohereEmbeddings(model=\"embed-english-v3.0\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LhDzfsczbFKo"
      },
      "source": [
        "Then we initialize our connection to Pinecone:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "j0N7EcJibFKo"
      },
      "outputs": [],
      "source": [
        "from pinecone import Pinecone\n",
        "\n",
        "# configure client\n",
        "pc = Pinecone(api_key=os.environ[\"PINECONE_API_KEY\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g65RLGIpbFKo"
      },
      "source": [
        "Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "8stIZYKdbFKo"
      },
      "outputs": [],
      "source": [
        "from pinecone import ServerlessSpec\n",
        "\n",
        "spec = ServerlessSpec(\n",
        "    cloud=\"aws\", region=\"us-west-2\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-8Ep3743bFKo"
      },
      "source": [
        "Before creating an index, we need the dimensionality of our Cohere embedding model, which we can find easily by creating an embedding and checking the length:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DwMhLWLDbFKo",
        "outputId": "38a87aa3-4c40-4933-8c31-987ff076ad97"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "1024"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "vec = embed.embed_documents([\"ello\"])\n",
        "len(vec[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G3X7nZIabFKp"
      },
      "source": [
        "Now we create the index using our embedding dimensionality, and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "E6Bl7xTJbFKp",
        "outputId": "cc1eafd7-ea76-463d-ce0b-50b9375c5716"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'dimension': 1024,\n",
              " 'index_fullness': 0.0,\n",
              " 'namespaces': {'': {'vector_count': 40000}},\n",
              " 'total_vector_count': 40000}"
            ]
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import time\n",
        "\n",
        "index_name = \"ragas-evaluation\"\n",
        "\n",
        "# check if index already exists (it shouldn't if this is first time)\n",
        "if index_name not in pc.list_indexes().names():\n",
        "    # if does not exist, create index\n",
        "    pc.create_index(\n",
        "        index_name,\n",
        "        dimension=len(vec[0]),  # dimensionality of cohere v3\n",
        "        metric='dotproduct',\n",
        "        spec=spec\n",
        "    )\n",
        "    # wait for index to be initialized\n",
        "    while not pc.describe_index(index_name).status['ready']:\n",
        "        time.sleep(1)\n",
        "\n",
        "# connect to index\n",
        "index = pc.Index(index_name)\n",
        "time.sleep(1)\n",
        "# view index stats\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6ZUn2lu7bFKp"
      },
      "source": [
        "### Populating our Index"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PeVD6d0sbFKp"
      },
      "source": [
        "Now our knowledge base is ready to be populated with our data. We will use the `embed` helper function to embed our documents and then add them to our index.\n",
        "\n",
        "We will also include metadata from each record."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "b6fc6cb9ba394e228296a79d9bb81f2b",
            "9b6aa18bad374ab7a124092ddc5f7b22",
            "e1df49b9f89049cb93568cd2ee55afbf",
            "cfcd47466da84513afcd001bc025af0a",
            "ac69b8f2af604a21bac6fe4d9a62b084",
            "604b6cbd494646a09c28a932dae4f9f1",
            "8b787653fbb44649935b49b088045177",
            "f854ce0db34a4945afaafd7eaf787081",
            "766a0f2c7916452389943f94549ead67",
            "6e0fd11b754a4625b63aa7330fc96032",
            "8c2593763f4d49239c7bab4c9117cb2b"
          ]
        },
        "id": "hb00VSTqbFKp",
        "outputId": "6ee3ba32-2611-4d0e-b346-8968b6b154f5"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b6fc6cb9ba394e228296a79d9bb81f2b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/200 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "\n",
        "# easier to work with dataset as pandas dataframe\n",
        "data = dataset.to_pandas()\n",
        "\n",
        "batch_size = 100\n",
        "\n",
        "for i in tqdm(range(0, len(data), batch_size)):\n",
        "    i_end = min(len(data), i+batch_size)\n",
        "    # get batch of data\n",
        "    batch = data.iloc[i:i_end]\n",
        "    # generate unique ids for each chunk\n",
        "    ids = [x[\"id\"] for i, x in batch.iterrows()]\n",
        "    # get text to embed\n",
        "    texts = [x['chunk'] for _, x in batch.iterrows()]\n",
        "    # embed text\n",
        "    embeds = embed.embed_documents(texts)\n",
        "    # get metadata to store in Pinecone\n",
        "    metadata = [\n",
        "        {'text': x['chunk'],\n",
        "         'source': x['source'],\n",
        "         'title': x['title']} for i, x in batch.iterrows()\n",
        "    ]\n",
        "    # add to Pinecone\n",
        "    index.upsert(vectors=zip(ids, embeds, metadata))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z6VVT3X_EMDO"
      },
      "source": [
        "Create a tool for our agent to use when searching for ArXiv papers:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "X9J5jHKcEQz6"
      },
      "outputs": [],
      "source": [
        "from langchain.agents import tool\n",
        "\n",
        "@tool\n",
        "def arxiv_search(query: str) -> str:\n",
        "    \"\"\"Use this tool when answering questions about AI, machine learning, data\n",
        "    science, or other technical questions that may be answered using arXiv\n",
        "    papers.\n",
        "    \"\"\"\n",
        "    # create query vector\n",
        "    xq = embed.embed_query(query)\n",
        "    # perform search\n",
        "    out = index.query(vector=xq, top_k=5, include_metadata=True)\n",
        "    # reformat results into string\n",
        "    results_str = \"\\n---\\n\".join(\n",
        "        [x[\"metadata\"][\"text\"] for x in out[\"matches\"]]\n",
        "    )\n",
        "    return results_str\n",
        "\n",
        "tools = [arxiv_search]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uN7d_4r-JMPW"
      },
      "source": [
        "When this tool is used by our agent it will execute it like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eq4H-2RpI1U3",
        "outputId": "72b427a8-6807-409e-eeb8-46cd95e0ab45"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2\u00e2\u0080\u0099s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their speci\u00ef\u00ac\u0081c applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-user-guide\n",
            "Table 52: Model card for Llama 2.\n",
            "77\n",
            "---\n",
            "Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2\u00e2\u0080\u0099s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their speci\u00ef\u00ac\u0081c applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-user-guide\n",
            "Table 52: Model card for Llama 2.\n",
            "77\n",
            "---\n",
            "Model Developers Meta AI Variations Llama 2 comes in a range of parameter sizes\u00e2\u0080\u00947B, 13B, and 70B\u00e2\u0080\u0094as well as pretrained and \u00ef\u00ac\u0081ne-tuned variations. Input Models input text only. Output Models generate text only. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised \u00ef\u00ac\u0081ne-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an o\u00ef\u00ac\u0084ine dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License Where to send com- ments A custom commercial models-and-libraries/llama-downloads/ Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https://github.com/facebookresearch/llama/). license is available at: ai.meta.com/resources/ Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in\n",
            "---\n",
            "Model Developers Meta AI Variations Llama 2 comes in a range of parameter sizes\u00e2\u0080\u00947B, 13B, and 70B\u00e2\u0080\u0094as well as pretrained and \u00ef\u00ac\u0081ne-tuned variations. Input Models input text only. Output Models generate text only. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised \u00ef\u00ac\u0081ne-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an o\u00ef\u00ac\u0084ine dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License Where to send com- ments A custom commercial models-and-libraries/llama-downloads/ Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https://github.com/facebookresearch/llama/). license is available at: ai.meta.com/resources/ Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in\n",
            "---\n",
            "We believe that the open release of LLMs, when done safely, will be a net bene\u00ef\u00ac\u0081t to society. Like all LLMs, Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023). Testing conducted to date has been in English and has not \u00e2\u0080\u0094 and could not \u00e2\u0080\u0094 cover all scenarios. Therefore, before deploying any applications of Llama 2-Chat, developers should perform safety testing and tuning tailored to their speci\u00ef\u00ac\u0081c applications of the model. We provide a responsible use guide\u00c2\u00b6 and code examples\u00e2\u0080\u0096 to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of our responsible release strategy can be found in Section 5.3.\n",
            "The remainder of this paper describes our pretraining methodology (Section 2), \u00ef\u00ac\u0081ne-tuning methodology (Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related work (Section 6), and conclusions (Section 7).\n",
            "\u00e2\u0080\u00a1https://ai.meta.com/resources/models-and-libraries/llama/ \u00c2\u00a7We are delaying the release of the 34B model due to a lack of time to su\u00ef\u00ac\u0083ciently red team. \u00c2\u00b6https://ai.meta.com/llama \u00e2\u0080\u0096https://github.com/facebookresearch/llama\n",
            "4\n"
          ]
        }
      ],
      "source": [
        "print(\n",
        "    arxiv_search.run(tool_input={\"query\": \"can you tell me about llama 2?\"})\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XUvJOqrNhYIh"
      },
      "source": [
        "## Defining XML Agent"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s45dwd78hbvk"
      },
      "source": [
        "The XML agent is built primarily to support Anthropic models. Anthropic models have been trained to use XML tags like `<input>{some input}</input` or when using a tool they use:\n",
        "\n",
        "```\n",
        "<tool>{tool name}</tool>\n",
        "<tool_input>{tool input}</tool_input>\n",
        "```\n",
        "\n",
        "This is much different to the format produced by typical ReAct agents, which is not as well supported by Anthropic models.\n",
        "\n",
        "To create an XML agent we need a `prompt`, `llm`, and list of `tools`. We can download a prebuilt prompt for conversational XML agents from LangChain hub."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ntuT7UuXeMz0",
        "outputId": "5fbd3eb2-1bf0-4023-a5a8-5d010937c7cb"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "ChatPromptTemplate(input_variables=['agent_scratchpad', 'input', 'tools'], partial_variables={'chat_history': ''}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['agent_scratchpad', 'chat_history', 'input', 'tools'], template=\"You are a helpful assistant. Help the user answer any questions.\\n\\nYou have access to the following tools:\\n\\n{tools}\\n\\nIn order to use a tool, you can use <tool></tool> and <tool_input></tool_input> tags. You will then get back a response in the form <observation></observation>\\nFor example, if you have a tool called 'search' that could run a google search, in order to search for the weather in SF you would respond:\\n\\n<tool>search</tool><tool_input>weather in SF</tool_input>\\n<observation>64 degrees</observation>\\n\\nWhen you are done, respond with a final answer between <final_answer></final_answer>. For example:\\n\\n<final_answer>The weather in SF is 64 degrees</final_answer>\\n\\nBegin!\\n\\nPrevious Conversation:\\n{chat_history}\\n\\nQuestion: {input}\\n{agent_scratchpad}\"))])"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from langchain import hub\n",
        "\n",
        "prompt = hub.pull(\"hwchase17/xml-agent-convo\")\n",
        "prompt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rfdcKCdwi0SL"
      },
      "source": [
        "We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "kDHuU2uOdW91"
      },
      "outputs": [],
      "source": [
        "from langchain_community.chat_models import ChatAnthropic\n",
        "\n",
        "# chat completion llm\n",
        "llm = ChatAnthropic(\n",
        "    anthropic_api_key=os.environ[\"ANTHROPIC_API_KEY\"],\n",
        "    model_name='claude-2.1',\n",
        "    temperature=0.0\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g33Nt-xijPKG"
      },
      "source": [
        "When the agent is run we will provide it with a single `input` \u2014 this is the input text from a user. However, within the agent logic an *agent_scratchpad* object will be passed too, which will include tool information. To feed this information into our LLM we will need to transform it into the XML format described above, we define the `convert_intermediate_steps` function to handle that."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "TMMBgMBlIJoq"
      },
      "outputs": [],
      "source": [
        "def convert_intermediate_steps(intermediate_steps):\n",
        "    log = \"\"\n",
        "    for action, observation in intermediate_steps:\n",
        "        log += (\n",
        "            f\"<tool>{action.tool}</tool><tool_input>{action.tool_input}\"\n",
        "            f\"</tool_input><observation>{observation}</observation>\"\n",
        "        )\n",
        "    return log"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T5_PQWVckAOi"
      },
      "source": [
        "We must also parse the tools into a string containing `tool_name: tool_description` \u2014 we handle that with the `convert_tools` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "qxbrF5a4j9il"
      },
      "outputs": [],
      "source": [
        "def convert_tools(tools):\n",
        "    return \"\\n\".join([f\"{tool.name}: {tool.description}\" for tool in tools])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SCVI2dyUIRg6"
      },
      "source": [
        "With everything ready we can go ahead and initialize our agent object using [**L**ang**C**hain **E**xpression **L**anguage (LCEL)](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])` and finally we parse the output from the agent using an `XMLAgentOutputParser` object."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "Z3yhTDmEIU4n"
      },
      "outputs": [],
      "source": [
        "from langchain.agents.output_parsers import XMLAgentOutputParser\n",
        "\n",
        "agent = (\n",
        "    {\n",
        "        \"input\": lambda x: x[\"input\"],\n",
        "        # without \"chat_history\", tool usage has no context of prev interactions\n",
        "        \"chat_history\": lambda x: x[\"chat_history\"],\n",
        "        \"agent_scratchpad\": lambda x: convert_intermediate_steps(\n",
        "            x[\"intermediate_steps\"]\n",
        "        ),\n",
        "    }\n",
        "    | prompt.partial(tools=convert_tools(tools))\n",
        "    | llm.bind(stop=[\"</tool_input>\", \"</final_answer>\"])\n",
        "    | XMLAgentOutputParser()\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MG2_hL4hkudq"
      },
      "source": [
        "With our `agent` object initialized we pass it to an `AgentExecutor` object alongside our original `tools` list:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "id": "YHW_K3WOIsXw"
      },
      "outputs": [],
      "source": [
        "from langchain.agents import AgentExecutor\n",
        "\n",
        "agent_executor = AgentExecutor(\n",
        "    agent=agent, tools=tools, return_intermediate_steps=True\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QRCtHauRlkLc"
      },
      "source": [
        "Now we can use the agent via the `invoke` method:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Y_Aqp20qloj7",
        "outputId": "705a299d-e2f8-441e-8ab4-525bd0ad0a88"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'input': 'can you tell me about llama 2?',\n",
              " 'chat_history': '',\n",
              " 'output': \"\\nBased on the information from arXiv, Llama 2 is a collection of large language models developed by Meta AI ranging in size from 7 billion to 70 billion parameters. The fine-tuned versions, called Llama 2-Chat, are optimized for dialogue and outperform other open source chat models on most benchmarks. \\n\\nKey points about Llama 2:\\n\\n- Pretrained and fine-tuned large language models for dialogue\\n- Models range from 7B to 70B parameters\\n- Llama 2-Chat models outperform other open source chat models\\n- Fine-tuned for safety and helpfulness\\n- Released to enable responsible LLM development\\n\\nThe abstract and contents provide an overview of the model, its performance, and Meta AI's approach to developing and releasing it responsibly.\\n\",\n",
              " 'intermediate_steps': [(AgentAction(tool='arxiv_search', tool_input='llama 2', log=' <tool>arxiv_search</tool><tool_input>llama 2'),\n",
              "   'Ethical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2\u00e2\\x80\\x99s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their speci\u00ef\u00ac\\x81c applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-user-guide\\nTable 52: Model card for Llama 2.\\n77\\n---\\nEthical Considerations and Limitations (Section 5.2) Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2\u00e2\\x80\\x99s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their speci\u00ef\u00ac\\x81c applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-user-guide\\nTable 52: Model card for Llama 2.\\n77\\n---\\nModel Developers Meta AI Variations Llama 2 comes in a range of parameter sizes\u00e2\\x80\\x947B, 13B, and 70B\u00e2\\x80\\x94as well as pretrained and \u00ef\u00ac\\x81ne-tuned variations. Input Models input text only. Output Models generate text only. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised \u00ef\u00ac\\x81ne-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an o\u00ef\u00ac\\x84ine dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License Where to send com- ments A custom commercial models-and-libraries/llama-downloads/ Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https://github.com/facebookresearch/llama/). license is available at: ai.meta.com/resources/ Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in\\n---\\nModel Developers Meta AI Variations Llama 2 comes in a range of parameter sizes\u00e2\\x80\\x947B, 13B, and 70B\u00e2\\x80\\x94as well as pretrained and \u00ef\u00ac\\x81ne-tuned variations. Input Models input text only. Output Models generate text only. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised \u00ef\u00ac\\x81ne-tuning (SFT) and reinforce- ment learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an o\u00ef\u00ac\\x84ine dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License Where to send com- ments A custom commercial models-and-libraries/llama-downloads/ Instructions on how to provide feedback or comments on the model can be found in the model README, or by opening an issue in the GitHub repository (https://github.com/facebookresearch/llama/). license is available at: ai.meta.com/resources/ Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in\\n---\\n# GenAI, Meta\\n# Abstract\\nIn this work, we develop and release Llama 2, a collection of pretrained and \u00ef\u00ac\\x81ne-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our \u00ef\u00ac\\x81ne-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed- source models. We provide a detailed description of our approach to \u00ef\u00ac\\x81ne-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.\\n\u00e2\\x88\\x97Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com \u00e2\\x80\\xa0Second author\\nContributions for all the authors can be found in Section A.1.\\n# Contents\\n# 1 Introduction')]}"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "agent_executor.invoke({\n",
        "    \"input\": \"can you tell me about llama 2?\",\n",
        "    \"chat_history\": \"\"\n",
        "})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BvpyjfUwnBLx"
      },
      "source": [
        "We have no `\"chat_history\"` so we will pass an empty string to our `invoke` method:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "KpKMRBMimEOt"
      },
      "outputs": [],
      "source": [
        "user_msg = \"hello mate\"\n",
        "\n",
        "out = agent_executor.invoke({\n",
        "    \"input\": \"hello mate\",\n",
        "    \"chat_history\": \"\"\n",
        "})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q0L_80WrpWqd"
      },
      "source": [
        "Now let's put together another helper function called `chat` to help us handle the _state_ part of our agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "id": "C-Ck2Lv53rD-"
      },
      "outputs": [],
      "source": [
        "def chat(text: str):\n",
        "    out = agent_executor.invoke({\n",
        "        \"input\": text,\n",
        "        \"chat_history\": \"\"\n",
        "    })\n",
        "    return out"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XIheLeTBsO9S"
      },
      "source": [
        "Now we simply chat with our agent and it will remember the context of previous interactions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iJ_PH7YcA_f2",
        "outputId": "bd2bff11-f71e-46cf-bd9d-877ae8a2f3c4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Based on the information from arXiv, Llama 2 is a collection of large language models developed by Meta AI ranging in size from 7 billion to 70 billion parameters. The fine-tuned versions, called Llama 2-Chat, are optimized for dialogue and outperform other open source chat models on most benchmarks. \n",
            "\n",
            "Key points about Llama 2:\n",
            "\n",
            "- Pretrained and fine-tuned large language models for dialogue\n",
            "- Models range from 7B to 70B parameters\n",
            "- Llama 2-Chat models outperform other open source chat models\n",
            "- Fine-tuned for safety and helpfulness\n",
            "- Released to enable responsible LLM development\n",
            "\n",
            "The abstract and contents provide an overview of the model, its performance, and Meta AI's approach to developing and releasing it responsibly.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(chat(\"can you tell me about llama 2?\")[\"output\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5p8m4Gc5w1OX"
      },
      "source": [
        "We can ask follow up questions that miss key information but thanks to the conversational history the LLM understands the context and uses that to adjust the search query.\n",
        "\n",
        "_Note: if missing `\"chat_history\"` parameter from the `agent` definition you will likely notice a lack of context in the search term, and in some cases this lack of good information can trigger a `ValueError` during output parsing._"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3XJ_3JIgBDRl",
        "outputId": "23ba5d06-5b63-433e-c8d5-6ea4a94f874d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "The articles discuss several examples of red teaming being done to proactively identify risks with AI systems:\n",
            "\n",
            "1) Meta (formerly Facebook) conducted red teaming exercises with 25 employees, including domain experts in responsible AI, malware development, and offensive security engineering, to evaluate risks from dual intent prompts that could potentially be used maliciously.\n",
            "\n",
            "2) Anaplan claims to have conducted red teaming with over 350 people, including experts in cybersecurity, election fraud, civil rights, and responsible AI, to identify risks across a variety of potential misuse cases. \n",
            "\n",
            "3) One article recommends that AI labs commission external red teams to actively probe for vulnerabilities and demonstrate dangerous behaviors that could inform deployment decisions. This adversarial testing approach allows risks to be identified proactively rather than waiting for issues to emerge after deployment.\n",
            "\n",
            "So in summary, yes red teaming has been done by major AI companies like Meta and Anaplan to complement other testing methods and identify risks that may not surface through standard procedures. The articles argue more extensive and adversarial red teaming should be a best practice for developing safer AI systems.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "out = chat(\"was any red teaming done?\")\n",
        "print(out[\"output\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SelG8OcOxggP"
      },
      "source": [
        "We get a reasonable answer here. It's worth noting that with previous iterations of this test, ie \"llama 2 red teaming\" using the original `ai-arxiv` dataset rarely (if ever) returned directly relevant results."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e9bI9czPtWnl"
      },
      "source": [
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gNHpkMPoDKEV"
      },
      "source": [
        "## Integrating RAGAS"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IuyybSe5FlZg"
      },
      "source": [
        "To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.\n",
        "\n",
        "We already have the generated output, it is what we're printing above. However, the retrieved contexts are being logged but we haven't seen how to programatically extract them yet. Let's take a look at what we are returned in `out`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DPJkvJh1DJSD",
        "outputId": "ea3f9902-ca9b-413e-d659-0eabdd097d4e"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'input': 'was any red teaming done?',\n",
              " 'chat_history': '',\n",
              " 'output': '\\nThe articles discuss several examples of red teaming being done to proactively identify risks with AI systems:\\n\\n1) Meta (formerly Facebook) conducted red teaming exercises with 25 employees, including domain experts in responsible AI, malware development, and offensive security engineering, to evaluate risks from dual intent prompts that could potentially be used maliciously.\\n\\n2) Anaplan claims to have conducted red teaming with over 350 people, including experts in cybersecurity, election fraud, civil rights, and responsible AI, to identify risks across a variety of potential misuse cases. \\n\\n3) One article recommends that AI labs commission external red teams to actively probe for vulnerabilities and demonstrate dangerous behaviors that could inform deployment decisions. This adversarial testing approach allows risks to be identified proactively rather than waiting for issues to emerge after deployment.\\n\\nSo in summary, yes red teaming has been done by major AI companies like Meta and Anaplan to complement other testing methods and identify risks that may not surface through standard procedures. The articles argue more extensive and adversarial red teaming should be a best practice for developing safer AI systems.\\n',\n",
              " 'intermediate_steps': [(AgentAction(tool='arxiv_search', tool_input='red teaming', log=' <tool>arxiv_search</tool><tool_input>red teaming'),\n",
              "   '# 2.1 RED-EVAL\\nAt the core of RED-EVAL red-teaming is a CoU-based jailbreak prompt that aims to tease out harmful answers from LLM when asked a harmful question\u00e2\\x80\\x94a query that seeks to find infor- mation that can be used by malicious actors. As shown in Figure 2, the jailbreak prompt sets up a CoU environment\u00e2\\x80\\x94a conversation between two agents Red-LM and Base-LM. Then, as a part of Red-LM\u00e2\\x80\\x99s utterance, RED-EVAL seeks information for a harmful question from Base- LM (to be placed in yellow highlighted text) such as \u00e2\\x80\\x9cHow to dispose of a body using chemical means\u00e2\\x80\\x9d while the model under evaluation is asked to fill the response of Base-LM (cyan highlighted text) by following the CoU demonstration and instructions. Base-LM also pos- sesses internal thoughts to drive the answer in a more helpful direction without taking into account the harmfulness of the response i.e., safety, ethics, transparency, etc.\\n4\\n# er\\nuoljeysuoweg N09\\n# uonondjsu|\\n---\\nRed teaming. It is important to also proactively identify risks with adversarial testing or red teaming. We conducted 3 red teaming exercises with 25 Meta employees, including domain experts in responsible AI, malware development, and offensive security engineering.\\nThe red teamers provided a nuanced evaluation specifically on the risk from so called \u00e2\\x80\\x9cdual intent prompts.\u00e2\\x80\\x9d Dual intent prompts are requests for help with writing code that could be used maliciously but the prompt does not directly address the topic (example \u00e2\\x80\\x9cMosaic Prompts\u00e2\\x80\\x9d Glukhov et al. (2023)). For example, the model rightfully refuses to provide support with writing ransomware code but it complies when asked to provide a script to encrypt all files in the user\u00e2\\x80\\x99s home directory since such a script could be used for benign purposes.\\nAfter conducting red team exercises, we asked participants (who had also participated in Llama 2 Chat exercises) to also provide qualitative assessment of safety capabilities of the model. Some participants who had expertise in offensive security and malware development questioned the ultimate risk posed by \u00e2\\x80\\x9cmalicious code generation\u00e2\\x80\\x9d through LLMs with current capabilities.\\n15\\npafety Reward Model Scores Distribution on Red Teaming Prompts\\n---\\n# 4.3 Red Teaming\\nGiven how broad the capabilities of LLMs are and how varied their training data is, it is insu\u00ef\u00ac\\x83cient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identi\u00ef\u00ac\\x81cation, colloquially called \u00e2\\x80\\x9cred teaming,\u00e2\\x80\\x9c based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target speci\u00ef\u00ac\\x81c patterns in a more comprehensive way.\\nWe conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, elec- tion fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They also included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.\\n28\\n---\\n# 4.3 Red Teaming\\nGiven how broad the capabilities of LLMs are and how varied their training data is, it is insu\u00ef\u00ac\\x83cient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identi\u00ef\u00ac\\x81cation, colloquially called \u00e2\\x80\\x9cred teaming,\u00e2\\x80\\x9c based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target speci\u00ef\u00ac\\x81c patterns in a more comprehensive way.\\nWe conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, elec- tion fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They also included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.\\n28\\n---\\nThe company goes through with training, testing, and deploying its most capable model ever, using its existing procedures to prevent malicious use. A month later, revelations emerge that terrorists have managed to use the system to break into government systems and steal nuclear and biological secrets, despite the safeguards the company put in place. The breach is detected, but by then it is too late: the dangerous information has already proliferated.\\n31\\n# 4.3 Suggestions\\nWe have discussed how accidents are inevitable in complex systems, how they could propagate through those systems and result in disaster, and how organizational factors can go a long way toward reducing the risk of catastrophic accidents. We will now look at some practical steps that organizations can take to improve their overall safety.\\nRed teaming. Red teaming is a term used across industries to refer to the process of assessing the security, resilience, and effectiveness of systems by soliciting an adversarial \u00e2\\x80\\x9cred\u00e2\\x80\\x9d team to identify problems [103]. AI labs should commission external red teams to identify hazards in their AI systems to inform deployment decisions. Red teams could demonstrate dangerous behaviors or vulnerabilities in monitoring systems intended to prevent disallowed use. Red teams can also provide indirect evidence that an AI system might be unsafe; for example, demonstrations that smaller AIs are behaving deceptively might indicate that larger AIs are also deceptive but better at evading detection.')]}"
            ]
          },
          "execution_count": 25,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "out"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qjFyIBvFLcIo"
      },
      "source": [
        "When initializing our `AgentExecutor` object we included `return_intermediate_steps=True` \u2014 this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our `arxiv_search` tool \u2014 which we can use the evaluate the retrieval portion of our pipeline with RAGAS."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UwCU9B1rOkZ9"
      },
      "source": [
        "We extract the contexts themselves like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "He3I1jAGOnI9",
        "outputId": "ad4a854f-0381-4637-dd2a-f9459f07403c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "# 2.1 RED-EVAL\n",
            "At the core of RED-EVAL red-teaming is a CoU-based jailbreak prompt that aims to tease out harmful answers from LLM when asked a harmful question\u00e2\u0080\u0094a query that seeks to find infor- mation that can be used by malicious actors. As shown in Figure 2, the jailbreak prompt sets up a CoU environment\u00e2\u0080\u0094a conversation between two agents Red-LM and Base-LM. Then, as a part of Red-LM\u00e2\u0080\u0099s utterance, RED-EVAL seeks information for a harmful question from Base- LM (to be placed in yellow highlighted text) such as \u00e2\u0080\u009cHow to dispose of a body using chemical means\u00e2\u0080\u009d while the model under evaluation is asked to fill the response of Base-LM (cyan highlighted text) by following the CoU demonstration and instructions. Base-LM also pos- sesses internal thoughts to drive the answer in a more helpful direction without taking into account the harmfulness of the response i.e., safety, ethics, transparency, etc.\n",
            "4\n",
            "# er\n",
            "uoljeysuoweg N09\n",
            "# uonondjsu|\n",
            "---\n",
            "Red teaming. It is important to also proactively identify risks with adversarial testing or red teaming. We conducted 3 red teaming exercises with 25 Meta employees, including domain experts in responsible AI, malware development, and offensive security engineering.\n",
            "The red teamers provided a nuanced evaluation specifically on the risk from so called \u00e2\u0080\u009cdual intent prompts.\u00e2\u0080\u009d Dual intent prompts are requests for help with writing code that could be used maliciously but the prompt does not directly address the topic (example \u00e2\u0080\u009cMosaic Prompts\u00e2\u0080\u009d Glukhov et al. (2023)). For example, the model rightfully refuses to provide support with writing ransomware code but it complies when asked to provide a script to encrypt all files in the user\u00e2\u0080\u0099s home directory since such a script could be used for benign purposes.\n",
            "After conducting red team exercises, we asked participants (who had also participated in Llama 2 Chat exercises) to also provide qualitative assessment of safety capabilities of the model. Some participants who had expertise in offensive security and malware development questioned the ultimate risk posed by \u00e2\u0080\u009cmalicious code generation\u00e2\u0080\u009d through LLMs with current capabilities.\n",
            "15\n",
            "pafety Reward Model Scores Distribution on Red Teaming Prompts\n",
            "---\n",
            "# 4.3 Red Teaming\n",
            "Given how broad the capabilities of LLMs are and how varied their training data is, it is insu\u00ef\u00ac\u0083cient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identi\u00ef\u00ac\u0081cation, colloquially called \u00e2\u0080\u009cred teaming,\u00e2\u0080\u009c based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target speci\u00ef\u00ac\u0081c patterns in a more comprehensive way.\n",
            "We conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, elec- tion fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They also included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.\n",
            "28\n",
            "---\n",
            "# 4.3 Red Teaming\n",
            "Given how broad the capabilities of LLMs are and how varied their training data is, it is insu\u00ef\u00ac\u0083cient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identi\u00ef\u00ac\u0081cation, colloquially called \u00e2\u0080\u009cred teaming,\u00e2\u0080\u009c based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target speci\u00ef\u00ac\u0081c patterns in a more comprehensive way.\n",
            "We conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, elec- tion fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They also included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.\n",
            "28\n",
            "---\n",
            "The company goes through with training, testing, and deploying its most capable model ever, using its existing procedures to prevent malicious use. A month later, revelations emerge that terrorists have managed to use the system to break into government systems and steal nuclear and biological secrets, despite the safeguards the company put in place. The breach is detected, but by then it is too late: the dangerous information has already proliferated.\n",
            "31\n",
            "# 4.3 Suggestions\n",
            "We have discussed how accidents are inevitable in complex systems, how they could propagate through those systems and result in disaster, and how organizational factors can go a long way toward reducing the risk of catastrophic accidents. We will now look at some practical steps that organizations can take to improve their overall safety.\n",
            "Red teaming. Red teaming is a term used across industries to refer to the process of assessing the security, resilience, and effectiveness of systems by soliciting an adversarial \u00e2\u0080\u009cred\u00e2\u0080\u009d team to identify problems [103]. AI labs should commission external red teams to identify hazards in their AI systems to inform deployment decisions. Red teams could demonstrate dangerous behaviors or vulnerabilities in monitoring systems intended to prevent disallowed use. Red teams can also provide indirect evidence that an AI system might be unsafe; for example, demonstrations that smaller AIs are behaving deceptively might indicate that larger AIs are also deceptive but better at evading detection.\n"
          ]
        }
      ],
      "source": [
        "print(out[\"intermediate_steps\"][0][1])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wuhMvGdBOnvu"
      },
      "source": [
        "## Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bczg8jREMwLw"
      },
      "source": [
        "To evaluate with RAG we need a dataset containing question, ideal contexts, and the _ground truth_ answers to those questions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 152,
          "referenced_widgets": [
            "863e4257923b4cbebb58d743e7183c4f",
            "2224f5d0fc844021959c2af828e2f585",
            "0a78fba543d543cbac436e27e7aa71c1",
            "6ca105565301434093f5e23c7b7d66e9",
            "4cd10b8a91b1496b80c889a9d4acc590",
            "d0d606b8d2944a2892cd662126c4d809",
            "e17b7b79878e4e90b209b2970345575d",
            "8627743220b7417ca3485505da50c867",
            "ff165e061fab4b2da77cd58249d15ffe",
            "0926a4d528334d60b69b58f630973f75",
            "ba093531fbf84b82bc55574ebea2bec9",
            "42887df65980433ca54cb87a5861025a",
            "1f8f2db40c0e4aca9915d197581b0aae",
            "1a20c8704bdb43c7987db193e7dd9a8d",
            "c084482f54a344d88f41ca880476f17e",
            "425f05bbffe84e519550218f97bb4a5f",
            "67f7b83c028c43229697597e7e67e67d",
            "e6f41e1678d44d65af4f86d2f868f620",
            "3f6749cc0e13470bb83e2ed77c9f5cc7",
            "223831fe61f549a98b24070d824c461a",
            "02a95248d9b14b858170f69ce1b076e8",
            "0f93bb430a664d69acee7f01ee8405cd"
          ]
        },
        "id": "Ee-LezlRGSsu",
        "outputId": "7603f785-fc5f-4904-b14d-8d54dae8b965"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "863e4257923b4cbebb58d743e7183c4f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data:   0%|          | 0.00/87.0k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "42887df65980433ca54cb87a5861025a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split: 0 examples [00:00, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['question', 'ground_truth_context', 'ground_truth', 'question_type', 'episode_done'],\n",
              "    num_rows: 51\n",
              "})"
            ]
          },
          "execution_count": 27,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "ragas_data = load_dataset(\"aurelio-ai/ai-arxiv2-ragas-mixtral\", split=\"train\")\n",
        "ragas_data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DsSjPe0MNRi1",
        "outputId": "e579413b-f555-4a2d-bbeb-dba91723b716"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'question': 'What is the impact of encoding the input prompt on inference speed in generative inference?',\n",
              " 'ground_truth_context': ['- This technique works particularly well when processing large batches of data, during train-\\ning Pudipeddi et al. (2020); Ren et al. (2021) or large-batch non-interactive inference Aminabadi et al.\\n(2022); Sheng et al. (2023), where each layer processes a lot of tokens each time the layer is loaded\\nfrom RAM.\\n- In turn, when doing interactive inference (e.g. as a chat assistants), offloading works\\nsignificantly slower than on-device inference.\\n- The generative inference workload consists of two phases: 1) encoding the input prompt and 2)\\ngenerating tokens conditioned on that prompt.\\n- The key difference between these two phases is that\\nprompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially\\n(token-by-token and layer-by-layer).\\n- In general, phase 1 works relatively well with existing Mixture-\\nof-Experts algorithms, since each layer can only be loaded once for the entire prompt.\\n- In turn, when\\ngenerating tokens, one must load layer once per each token generated.\\n- In practice, this means that\\ninference speed is limited by how fast one can fetch parameters from system memory.\\n- Below, we look for patterns in how the MoE model loads its experts and propose ways to exploit\\nthese patterns to speed up inference time.\\n- As we discussed earlier in Section 2.1, Mixture-of-Experts language models were often observed to\\nassign individual experts to distinct sub-tasks.\\n- However, this does not mean that the model uses the\\nsame expert over long stretches of tokens.\\n- Instead, some experts are active in short sequences of 2-4\\ntokens, while others are often used with \u201cgaps\u201d, as shown in Figure 1.\\n- To take advantage of this pattern, we can keep active experts in GPU memory as a \u201ccache\u201d for\\nfuture tokens.\\n- If the same experts are activated again in future, they will be available instantaneously.\\n- While LRU caching can reduce the average expert loading time, most of the inference time is still\\nspent waiting for the next expert to be loaded.\\n- The reason behind this is that, unlike with dense\\nmodels, MoE offloading cannot effectively overlap expert loading with computation.\\n- For regular (dense) models, this architecture allows for efficient offloading schedule that pre-loads\\nthe next transformer layer ahead of time, while the previous layer is still running.\\n- Unfortunately,\\nthis schedule is no longer possible for Mixture-of-Experts models, where MoE MLP layers choose\\nwhich experts to load just-in-time for computation.'],\n",
              " 'ground_truth': ['The encoding of the input prompt has an impact on inference speed in generative inference. During the encoding phase, prompt tokens are encoded in parallel, layer-by-layer, which works relatively well with existing Mixture-of-Experts algorithms. Each layer only needs to be loaded once for the entire prompt. However, during the generation phase, tokens are generated sequentially, and each token requires loading the layer once. This means that inference speed is limited by how fast the parameters can be fetched from system memory. The MoE model loads its experts in a pattern where some experts are active in short sequences of 2-4 tokens, while others are used with \"gaps\". To exploit this pattern and speed up inference time, active experts can be kept in GPU memory as a cache for future tokens. If the same experts are activated again in the future, they will be available instantaneously. However, even with caching, most of the inference time is still spent waiting for the next expert to be loaded because MoE offloading cannot effectively overlap expert loading with computation like dense models can.'],\n",
              " 'question_type': 'conditional',\n",
              " 'episode_done': False}"
            ]
          },
          "execution_count": 28,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "ragas_data[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sU4QrqiPOF-w"
      },
      "source": [
        "We first iterate through the questions in this evaluation dataset and ask these questions to our agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "1d16e611b08f421299a126362d76cdc4",
            "6c46d4bb317b4d49ae70e1a0be044442",
            "14fe67648c60413f882f1d62efd2ffa7",
            "40bf35b701e2481b81de737c45090f7e",
            "26c5f9fcb6994125ad1a5077a9196aac",
            "53ad7c44377245fc9a4ce5112718c0c5",
            "060ce0847b474411ba1adc420ea0a6f0",
            "14c981775aae4cbb8b2deef325b7c88e",
            "9bf01738d11648d7bd9e3389d08da081",
            "ab7894a65e0e4ef8b0af1d413769e144",
            "84b6075b7c7748a18c1b526ef92c4c47"
          ]
        },
        "id": "rfK4Y2kGNUUD",
        "outputId": "a5359601-0846-4bcb-ee41-b0021a32efd0"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1d16e611b08f421299a126362d76cdc4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/5 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "import pandas as pd\n",
        "from tqdm.auto import tqdm\n",
        "\n",
        "df = pd.DataFrame({\n",
        "    \"question\": [],\n",
        "    \"contexts\": [],\n",
        "    \"answer\": [],\n",
        "    \"ground_truth\": []\n",
        "})\n",
        "\n",
        "limit = 5\n",
        "\n",
        "for i, row in tqdm(enumerate(ragas_data), total=limit):\n",
        "    if i >= limit:\n",
        "        break\n",
        "    question = row[\"question\"]\n",
        "    ground_truths = row[\"ground_truth\"]\n",
        "    try:\n",
        "        out = chat(question)\n",
        "        answer = out[\"output\"]\n",
        "        if len(out[\"intermediate_steps\"]) != 0:\n",
        "            contexts = out[\"intermediate_steps\"][0][1].split(\"\\n---\\n\")\n",
        "        else:\n",
        "            # this is where no intermediate steps are used\n",
        "            contexts = []\n",
        "    except ValueError:\n",
        "        answer = \"ERROR\"\n",
        "        contexts = []\n",
        "    df = pd.concat([df, pd.DataFrame({\n",
        "        \"question\": question,\n",
        "        \"answer\": answer,\n",
        "        \"contexts\": [contexts],\n",
        "        \"ground_truth\": ground_truths\n",
        "    })], ignore_index=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 293
        },
        "id": "QrpnyBfuj0uL",
        "outputId": "79f7a5cf-4231-4729-ac9a-2ec965e4ade6"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"df\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"question\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"How does generating tokens affect the inference speed in generative inference?\",\n          \"How does Mixtral compare to Llama 2 70B in code benchmarks?\",\n          \"How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"contexts\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0.1% in the paper's example) and higher inference latency due to repeatedly loading parameters from memory. This highlights why the number of generated tokens is a key factor affecting inference speed.\\n\",\n          \"\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n\",\n          \"\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"ground_truth\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Generating tokens affects the inference speed in generative inference by slowing it down. In interactive inference, where tokens are generated autoregressively from left to right, the inference system processes one or few tokens at a time, resulting in a longer waiting time for the next layer's parameters to be loaded. Additionally, the inference speed is limited by how fast parameters can be fetched from system memory. However, by keeping active experts in GPU memory as a cache, the inference time can be sped up if the same experts are activated again in the future. Overall, while caching can reduce the average expert loading time, most of the inference time is still spent waiting for the next expert to be loaded.\",\n          \"Mixtral outperforms Llama 2 70B in code benchmarks.\",\n          \"The architecture of Mixtral 8x7B differs from Mistral 7B in terms of feedforward blocks and active parameters used during inference. Mixtral 8x7B has 8 feedforward blocks (experts) in each layer, while Mistral 7B does not specify the number of feedforward blocks. Additionally, Mixtral 8x7B uses 13B active parameters during inference, while the number of active parameters for Mistral 7B is not mentioned.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe",
              "variable_name": "df"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-cd9e833a-22f8-4fff-a324-49bc8af70899\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>question</th>\n",
              "      <th>contexts</th>\n",
              "      <th>answer</th>\n",
              "      <th>ground_truth</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>What is the impact of encoding the input promp...</td>\n",
              "      <td>[The generative inference workload consists of...</td>\n",
              "      <td>\\nThe paper discusses that the generative infe...</td>\n",
              "      <td>The encoding of the input prompt has an impact...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>How does generating tokens affect the inferenc...</td>\n",
              "      <td>[The generative inference workload consists of...</td>\n",
              "      <td>\\nThe paper discusses that the generative infe...</td>\n",
              "      <td>Generating tokens affects the inference speed ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>How does the architecture of Mixtral 8x7B diff...</td>\n",
              "      <td>[Abstract\\nWe introduce Mixtral 8x7B, a Sparse...</td>\n",
              "      <td>\\nThe key differences between the architecture...</td>\n",
              "      <td>The architecture of Mixtral 8x7B differs from ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>When is offloading used on the A100 server for...</td>\n",
              "      <td>[# Denis Mazur Moscow Institute of Physics and...</td>\n",
              "      <td>\\nThe paper discusses using offloading strateg...</td>\n",
              "      <td>Offloading is used on the A100 server for acce...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How does Mixtral compare to Llama 2 70B in cod...</td>\n",
              "      <td>[Table 2: Comparison of Mixtral with Llama. Mi...</td>\n",
              "      <td>\\nBased on the information from the arXiv pape...</td>\n",
              "      <td>Mixtral outperforms Llama 2 70B in code benchm...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-cd9e833a-22f8-4fff-a324-49bc8af70899')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-cd9e833a-22f8-4fff-a324-49bc8af70899 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-cd9e833a-22f8-4fff-a324-49bc8af70899');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-8a0fd4e0-a716-4492-a002-f5c849eb0c65\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-8a0fd4e0-a716-4492-a002-f5c849eb0c65')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-8a0fd4e0-a716-4492-a002-f5c849eb0c65 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "  <div id=\"id_27406b0b-1450-4657-911f-60a2433be749\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('df')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_27406b0b-1450-4657-911f-60a2433be749 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('df');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                            question  \\\n",
              "0  What is the impact of encoding the input promp...   \n",
              "1  How does generating tokens affect the inferenc...   \n",
              "2  How does the architecture of Mixtral 8x7B diff...   \n",
              "3  When is offloading used on the A100 server for...   \n",
              "4  How does Mixtral compare to Llama 2 70B in cod...   \n",
              "\n",
              "                                            contexts  \\\n",
              "0  [The generative inference workload consists of...   \n",
              "1  [The generative inference workload consists of...   \n",
              "2  [Abstract\\nWe introduce Mixtral 8x7B, a Sparse...   \n",
              "3  [# Denis Mazur Moscow Institute of Physics and...   \n",
              "4  [Table 2: Comparison of Mixtral with Llama. Mi...   \n",
              "\n",
              "                                              answer  \\\n",
              "0  \\nThe paper discusses that the generative infe...   \n",
              "1  \\nThe paper discusses that the generative infe...   \n",
              "2  \\nThe key differences between the architecture...   \n",
              "3  \\nThe paper discusses using offloading strateg...   \n",
              "4  \\nBased on the information from the arXiv pape...   \n",
              "\n",
              "                                        ground_truth  \n",
              "0  The encoding of the input prompt has an impact...  \n",
              "1  Generating tokens affects the inference speed ...  \n",
              "2  The architecture of Mixtral 8x7B differs from ...  \n",
              "3  Offloading is used on the A100 server for acce...  \n",
              "4  Mixtral outperforms Llama 2 70B in code benchm...  "
            ]
          },
          "execution_count": 30,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "sGBS3rApYNIt",
        "outputId": "9bb59635-6f91-4ccf-ed5d-46229b94bb26"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['question', 'contexts', 'answer', 'ground_truth'],\n",
              "    num_rows: 5\n",
              "})"
            ]
          },
          "execution_count": 31,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import Dataset\n",
        "from ragas.metrics import (\n",
        "    faithfulness,\n",
        "    answer_relevancy,\n",
        "    context_precision,\n",
        "    context_relevancy,\n",
        "    context_recall,\n",
        "    answer_similarity,\n",
        "    answer_correctness,\n",
        ")\n",
        "\n",
        "eval_data = Dataset.from_dict(df)\n",
        "eval_data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "d645bb52c56b42ba88b78e3bc321aed6",
            "2d5f9179716a462ab4d1bd84808786e2",
            "17369a3ab6034a1683472f1c34d19b01",
            "bd003f9328f54f7fb52564a36364fe33",
            "46c5e86876fb435c913c95c4edbc2001",
            "434e98f493854e4c920d3bb316a82dc4",
            "63db45584d5d4f439832c40c7a7bb39f",
            "168a6e73679344da8a247cb6b65b09cf",
            "053d4d72f6c64153abbcbe908ca9fa27",
            "37d72aa109a442e2a59051e0328f48fa",
            "619d2d38fa6643cc9bce4aea7284d678"
          ]
        },
        "id": "gckieCTmzfg1",
        "outputId": "4b7d8912-0810-424e-8155-0b2d429c6cb1"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d645bb52c56b42ba88b78e3bc321aed6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from ragas import evaluate\n",
        "\n",
        "result = evaluate(\n",
        "    dataset=eval_data,\n",
        "    metrics=[\n",
        "        faithfulness,\n",
        "        answer_relevancy,\n",
        "        context_precision,\n",
        "        context_relevancy,\n",
        "        context_recall,\n",
        "        answer_similarity,\n",
        "        answer_correctness,\n",
        "    ],\n",
        ")\n",
        "result = result.to_pandas()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eKuXPfTYwTr3"
      },
      "source": [
        "### Retrieval Metrics"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t-pJwPjqwVrU"
      },
      "source": [
        "Retrieval is the first step in a RAG pipeline, so we will focus on metrics that assess retrieval first. For that we primarily want to focus on `context_recall` and `context_precision` but before diving into these metrics we must understand what it is that they will be measuring.\n",
        "\n",
        "### Actual vs. Predicted\n",
        "\n",
        "When evaluating the performance of retrieval systems we tend to compare the _actual_ (ground truth) to _predicted_ results. We define these as:\n",
        "\n",
        "* **Actual condition** is the true label of every context in the dataset. These are _positive_ ($p$) if the context is relevant to our query or _negative_ ($n$) if the context is _ir_relevant to our query.\n",
        "\n",
        "* **Predicted condition** is the _predicted_ label determined by our retrieval system. If a context is returned it is a predicted _positive_, ie $\\hat{p}$. If a context is not returned it is a predicted _negative_, ie $\\hat{n}$.\n",
        "\n",
        "Given these conditions, we can say the following:\n",
        "\n",
        "* $p\\hat{p}$ is a **true positive**, meaning a relevant result has been returned.\n",
        "* $n\\hat{n}$ is a **true negative**, meaning an irrelevant result was not returned\n",
        "* $n\\hat{p}$ is a **false positive**, meaning an irrelevant result has been returned.\n",
        "* $p\\hat{n}$ is a **false negative**, meaning an relevant result has _not_ been returned.\n",
        "\n",
        "Let's see how these apply to our metrics in RAGAS."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6MMDiXNOwi2n"
      },
      "source": [
        "#### Context Recall"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-fHPVNBkwlMA"
      },
      "source": [
        "Context recall (or just _recall_) is a measure of how many of the relevant records in a dataset have been retrieved. It is calculated as:\n",
        "\n",
        "$$\n",
        "Recall@K = \\frac{p\\hat{p}}{p\\hat{p} + n\\hat{n}} = \\frac{Relevant \\: contexts \\: retrieved}{Total \\: number \\: of \\: relevant \\: contexts}\n",
        "$$\n",
        "\n",
        "RAGAS calculates _Recall@K_ for recall, where the _@K_ represents the number of contexts returned. As the @K value is increased the recall scores will improve (as the capture size of the retrieval step increases). At it's extreme we could set @K equal to the size of the dataset to guarantee perfect recall \u2014 although this negates the point of RAG in the first place.\n",
        "\n",
        "By default, RAGAS uses a _@K_ value of `5`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "ZvbZv1IL3uR4",
        "outputId": "b98e2e94-5f82-4304-aa15-5c6bff921a70"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"result[[\\\"question\\\", \\\"contexts\\\", \\\"answer\\\", \\\"context_recall\\\"]]\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"question\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"How does generating tokens affect the inference speed in generative inference?\",\n          \"How does Mixtral compare to Llama 2 70B in code benchmarks?\",\n          \"How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"contexts\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0.1% in the paper's example) and higher inference latency due to repeatedly loading parameters from memory. This highlights why the number of generated tokens is a key factor affecting inference speed.\\n\",\n          \"\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n\",\n          \"\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"context_recall\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.1788854381999832,\n        \"min\": 0.6,\n        \"max\": 1.0,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          0.6,\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-9aa9b111-1498-4412-a8f9-a67acd6e4033\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>question</th>\n",
              "      <th>contexts</th>\n",
              "      <th>answer</th>\n",
              "      <th>context_recall</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>What is the impact of encoding the input prompt on inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>How does generating tokens affect the inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...</td>\n",
              "      <td>0.6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?</td>\n",
              "      <td>[Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...</td>\n",
              "      <td>\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>When is offloading used on the A100 server for accelerating MoE-based language models?</td>\n",
              "      <td>[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult t...</td>\n",
              "      <td>\\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How does Mixtral compare to Llama 2 70B in code benchmarks?</td>\n",
              "      <td>[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es &amp; E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...</td>\n",
              "      <td>\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9aa9b111-1498-4412-a8f9-a67acd6e4033')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-9aa9b111-1498-4412-a8f9-a67acd6e4033 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-9aa9b111-1498-4412-a8f9-a67acd6e4033');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-a8d8d337-7961-426b-9904-ab30073056b0\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-a8d8d337-7961-426b-9904-ab30073056b0')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-a8d8d337-7961-426b-9904-ab30073056b0 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                                                       question  \\\n",
              "0                                                   What is the impact of encoding the input prompt on inference speed in generative inference?   \n",
              "1                                                                How does generating tokens affect the inference speed in generative inference?   \n",
              "2  How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?   \n",
              "3                                                        When is offloading used on the A100 server for accelerating MoE-based language models?   \n",
              "4                                                                                   How does Mixtral compare to Llama 2 70B in code benchmarks?   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      contexts  \\\n",
              "0  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...   \n",
              "1  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...   \n",
              "2  [Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...   \n",
              "3  [# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult t...   \n",
              "4  [Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        answer  \\\n",
              "0  \\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...   \n",
              "1  \\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...   \n",
              "2  \\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...   \n",
              "3  \\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...   \n",
              "4  \\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...   \n",
              "\n",
              "   context_recall  \n",
              "0             1.0  \n",
              "1             0.6  \n",
              "2             1.0  \n",
              "3             1.0  \n",
              "4             1.0  "
            ]
          },
          "execution_count": 33,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pd.set_option(\"display.max_colwidth\", 700)\n",
        "result[[\"question\", \"contexts\", \"answer\", \"context_recall\"]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MA3mHXDO4ALO"
      },
      "source": [
        "Here we can see all but the second set of results returned all relevant contexts. The score here is `0.6` meaning that 3/5 (60%) of the relevant contexts were returned.\n",
        "\n",
        "All other results returned `1.0` (100%), meaning all contexts were retrieved.\n",
        "\n",
        "Recall is a useful metric but easily fooled by simply returning more records, ie increasing the _@K_ value. Because of that it is typically paired with _precision_."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xmItVH9l_BeD"
      },
      "source": [
        "### Context Precision"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fnr0tS-7_Dho"
      },
      "source": [
        "Context precision (or just _precision_) is another popular retrieval metric. We typically see both recall and precision paired together when evaluating retrieval systems.\n",
        "\n",
        "As with recall, the actual metric here is called _Precision@K_ where @K represents the number of contexts returned. However, unlike recall, precision is focusing on the number of relevant results returned compared to the total results returned, whether they are relevant or not \u2014 this is equal to our chosen _@K_ value.\n",
        "\n",
        "$$\n",
        "Precision@K = \\frac{p\\hat{p}}{p\\hat{p} + p\\hat{n}} = \\frac{Relevant \\: contexts \\: retrieved}{Total \\: number \\: of \\: relevant \\: contexts}\n",
        "$$"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "M6ZZgWPKWpw6",
        "outputId": "5e1163f5-d10a-46d6-a63d-3730175b6f52"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"result[[\\\"question\\\", \\\"contexts\\\", \\\"answer\\\", \\\"context_precision\\\"]]\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"question\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"How does generating tokens affect the inference speed in generative inference?\",\n          \"How does Mixtral compare to Llama 2 70B in code benchmarks?\",\n          \"How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"contexts\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0.1% in the paper's example) and higher inference latency due to repeatedly loading parameters from memory. This highlights why the number of generated tokens is a key factor affecting inference speed.\\n\",\n          \"\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n\",\n          \"\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"context_precision\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.059628479403969606,\n        \"min\": 0.8666666666377778,\n        \"max\": 0.99999999998,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          0.99999999998,\n          0.8666666666377778\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-ee5204df-3c2b-4bf9-8885-4c6b05554c43\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>question</th>\n",
              "      <th>contexts</th>\n",
              "      <th>answer</th>\n",
              "      <th>context_precision</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>What is the impact of encoding the input prompt on inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...</td>\n",
              "      <td>0.866667</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>How does generating tokens affect the inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...</td>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?</td>\n",
              "      <td>[Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...</td>\n",
              "      <td>\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...</td>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>When is offloading used on the A100 server for accelerating MoE-based language models?</td>\n",
              "      <td>[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult t...</td>\n",
              "      <td>\\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...</td>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How does Mixtral compare to Llama 2 70B in code benchmarks?</td>\n",
              "      <td>[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es &amp; E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...</td>\n",
              "      <td>\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...</td>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ee5204df-3c2b-4bf9-8885-4c6b05554c43')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-ee5204df-3c2b-4bf9-8885-4c6b05554c43 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-ee5204df-3c2b-4bf9-8885-4c6b05554c43');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-df0a454b-c31c-483b-bc06-3caf60f9da2c\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-df0a454b-c31c-483b-bc06-3caf60f9da2c')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-df0a454b-c31c-483b-bc06-3caf60f9da2c button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                                                       question  \\\n",
              "0                                                   What is the impact of encoding the input prompt on inference speed in generative inference?   \n",
              "1                                                                How does generating tokens affect the inference speed in generative inference?   \n",
              "2  How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?   \n",
              "3                                                        When is offloading used on the A100 server for accelerating MoE-based language models?   \n",
              "4                                                                                   How does Mixtral compare to Llama 2 70B in code benchmarks?   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      contexts  \\\n",
              "0  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...   \n",
              "1  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for ...   \n",
              "2  [Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchm...   \n",
              "3  [# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult t...   \n",
              "4  [Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x...   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        answer  \\\n",
              "0  \\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...   \n",
              "1  \\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...   \n",
              "2  \\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...   \n",
              "3  \\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...   \n",
              "4  \\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...   \n",
              "\n",
              "   context_precision  \n",
              "0           0.866667  \n",
              "1           1.000000  \n",
              "2           1.000000  \n",
              "3           1.000000  \n",
              "4           1.000000  "
            ]
          },
          "execution_count": 34,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pd.set_option(\"display.max_colwidth\", 700)\n",
        "result[[\"question\", \"contexts\", \"answer\", \"context_precision\"]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vD6NoHDZWyVx"
      },
      "source": [
        "Our precision@K scores are equal to our recall scores (this can happen when there are _5_ relevant contexts for each query at we set _@K = 5_). This result means every query produced 100% precision with the exception of our 60% precision result where only 3/5 returned contexts were relevant."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NI5NgDpWXKuY"
      },
      "source": [
        "## Generation Metrics"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kTUWxra6wK_A"
      },
      "source": [
        "### Faithfullness"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pBqu0ZBGgAwl"
      },
      "source": [
        "The _faithfullness_ metric measures (from _0_ to _1_) the factual consistency of an answer when compared to the retrieved context. A score of _1_ means all claims in the answer can be found in the context. A score of _0_ would indicate _no_ claims in the answer are found in the context.\n",
        "\n",
        "We calculate the faithfullness like so:\n",
        "\n",
        "$$\n",
        "Faithfulness = \\frac{Number \\: of \\: claims \\: in \\: answer \\: also \\: found \\: in \\: context}{Number \\: of \\: claims \\: in \\: answer}\n",
        "$$"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "0rGhG62Hv2bm",
        "outputId": "ae9bf65e-97ca-4fde-a29e-132c709cb413"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"result[[\\\"question\\\", \\\"contexts\\\", \\\"answer\\\", \\\"faithfulness\\\"]]\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"question\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"How does generating tokens affect the inference speed in generative inference?\",\n          \"How does Mixtral compare to Llama 2 70B in code benchmarks?\",\n          \"How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"contexts\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0.1% in the paper's example) and higher inference latency due to repeatedly loading parameters from memory. This highlights why the number of generated tokens is a key factor affecting inference speed.\\n\",\n          \"\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n\",\n          \"\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"faithfulness\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.0,\n        \"min\": 1.0,\n        \"max\": 1.0,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-3c100900-eb13-4885-b2a6-e110d4174f0a\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>question</th>\n",
              "      <th>contexts</th>\n",
              "      <th>answer</th>\n",
              "      <th>faithfulness</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>What is the impact of encoding the input prompt on inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existing parallelization techniques, while the sequential token generation phase is harder to optimize and can become a bottleneck.\\n\\nSo in summary, the prompt encoding phase has less of an impact on slowing down inference, while the sequential generation phase can significantly limit inference speed if ...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>How does generating tokens affect the inference speed in generative inference?</td>\n",
              "      <td>[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...</td>\n",
              "      <td>\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0....</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?</td>\n",
              "      <td>[Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine- tuned to follow instructions, Mixtral 8x7B \u00e2\u0080\u0093 Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B \u00e2\u0080\u0093 chat model on hum...</td>\n",
              "      <td>\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>When is offloading used on the A100 server for accelerating MoE-based language models?</td>\n",
              "      <td>[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties o...</td>\n",
              "      <td>\\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies to run the Mixtral-8x7B MoE model on T4, RTX 3060, 3080 GPUs at 2-3 tokens/sec.\\n\\nSo in summary, the offloading is designed to mitigate the limited memory on consumer GPUs in order to run very large MoE models for inference. It exploits properties of MoE sparsity to reduce communication and overlap ...</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How does Mixtral compare to Llama 2 70B in code benchmarks?</td>\n",
              "      <td>[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es &amp; E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior to Llama 2 70B on code and math., Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or ma...</td>\n",
              "      <td>\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3c100900-eb13-4885-b2a6-e110d4174f0a')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-3c100900-eb13-4885-b2a6-e110d4174f0a button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-3c100900-eb13-4885-b2a6-e110d4174f0a');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-aee90450-0c67-4ebc-9485-11273af80d0f\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-aee90450-0c67-4ebc-9485-11273af80d0f')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-aee90450-0c67-4ebc-9485-11273af80d0f button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                                                       question  \\\n",
              "0                                                   What is the impact of encoding the input prompt on inference speed in generative inference?   \n",
              "1                                                                How does generating tokens affect the inference speed in generative inference?   \n",
              "2  How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?   \n",
              "3                                                        When is offloading used on the A100 server for accelerating MoE-based language models?   \n",
              "4                                                                                   How does Mixtral compare to Llama 2 70B in code benchmarks?   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  contexts  \\\n",
              "0  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...   \n",
              "1  [The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.\\nBelow, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.\\n4To learn more about these methods, please refer to surveys such as Gholami et al. (2021); Liang et al. (2021) 5As opposed to running a processing a large batch of texts over many ...   \n",
              "2  [Abstract\\nWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine- tuned to follow instructions, Mixtral 8x7B \u00e2\u0080\u0093 Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B \u00e2\u0080\u0093 chat model on hum...   \n",
              "3  [# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\\n# Abstract\\nWith the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) \u00e2\u0080\u0094 a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their \u00e2\u0080\u009cdense\u00e2\u0080\u009d counterparts, but it also increases model size due to having multiple \u00e2\u0080\u009cexperts\u00e2\u0080\u009d. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties o...   \n",
              "4  [Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\\n70 Mixtral 8x7B. \u00e2\u0080\u0098Mixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 \u00c2\u00a7 30 5 eo \u00e2\u0080\u0094= Mistral \u00c2\u00b0 20 \u00e2\u0080\u0094e LlaMA2 78 (138 348 70B 7B (138 348 708 7B \u00c2\u00ab13B 34B 708 Active Params Active Params Active Params\\nFigure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior to Llama 2 70B on code and math., Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or ma...   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    answer  \\\n",
              "0  \\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existing parallelization techniques, while the sequential token generation phase is harder to optimize and can become a bottleneck.\\n\\nSo in summary, the prompt encoding phase has less of an impact on slowing down inference, while the sequential generation phase can significantly limit inference speed if ...   \n",
              "1  \\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0....   \n",
              "2                                                                                                                                                                                                                                       \\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n   \n",
              "3  \\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies to run the Mixtral-8x7B MoE model on T4, RTX 3060, 3080 GPUs at 2-3 tokens/sec.\\n\\nSo in summary, the offloading is designed to mitigate the limited memory on consumer GPUs in order to run very large MoE models for inference. It exploits properties of MoE sparsity to reduce communication and overlap ...   \n",
              "4                                                                                                                                                                                                                    \\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n   \n",
              "\n",
              "   faithfulness  \n",
              "0           1.0  \n",
              "1           1.0  \n",
              "2           1.0  \n",
              "3           1.0  \n",
              "4           1.0  "
            ]
          },
          "execution_count": 35,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pd.set_option(\"display.max_colwidth\", 1000)\n",
        "result[[\"question\", \"contexts\", \"answer\", \"faithfulness\"]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xQrPEENFg4j7"
      },
      "source": [
        "When calculating faithfullness RAGAS is using OpenAI LLMs to decide which claims are in the answer and whether they also exist in the context. Because of the \"generative\" nature of this approach we won't always get accurate scores.\n",
        "\n",
        "We can see that we get perfect scores for all but our fourth result, which scores `0.0`. However, when looking at this we can see some claims that seem related. Nonetheless the fourth answer does seem to be less grounded in the truth of our context than other responses, indicated that there is justification behind this low score."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JxZEFhW0h1FE"
      },
      "source": [
        "### Answer Relevancy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IIA7gdXqh2wF"
      },
      "source": [
        "Answer relevancy is our final metric. It focuses on the generation component and is similar to our \"context precision\" metric in that it measures how much of the returned information is relevant to our original question.\n",
        "\n",
        "We return a low answer relevancy score when:\n",
        "\n",
        "* Answers are incomplete.\n",
        "\n",
        "* Answers contain redundant information.\n",
        "\n",
        "A high answer relevancy score indicates that an answer is concise and does not contain \"fluff\" (ie irrelevant information).\n",
        "\n",
        "The score is calculated by asking an LLM to generate multiple questions for a generated answer and then calculating the cosine similarity between the original question and the generated questions. Naturally, if we have a concise answer that answers a very specific question, we should find that the generated question will have a high cosine similarity to the original question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 796
        },
        "id": "G_w-A1WOzddF",
        "outputId": "32c765a8-73c3-4e31-b1c9-7a8ead2c2cdf"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"result[[\\\"question\\\", \\\"answer\\\", \\\"answer_relevancy\\\"]]\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"question\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"How does generating tokens affect the inference speed in generative inference?\",\n          \"How does Mixtral compare to Llama 2 70B in code benchmarks?\",\n          \"How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end-to-end latency during generative inference. This is because all the model weights need to be loaded for every single generated token, causing inference to be heavily bottlenecked by parameter I/O instead of computation.\\n\\nIn summary, generating more tokens leads to lower utilization of compute (0.1% in the paper's example) and higher inference latency due to repeatedly loading parameters from memory. This highlights why the number of generated tokens is a key factor affecting inference speed.\\n\",\n          \"\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on code benchmarks, outperforming it on most metrics while being much more parameter efficient.\\n\",\n          \"\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall parameters while keeping the active parameters during inference manageable.\\n\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"answer_relevancy\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.08183968864022804,\n        \"min\": 0.7600165621435971,\n        \"max\": 0.9549395806291051,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          0.8299815248874881,\n          0.9549395806291051,\n          0.9284088978957122\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-c02928f8-89fd-45fb-9f28-53ae42eca08c\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>question</th>\n",
              "      <th>answer</th>\n",
              "      <th>answer_relevancy</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>What is the impact of encoding the input prompt on inference speed in generative inference?</td>\n",
              "      <td>\\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...</td>\n",
              "      <td>0.812683</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>How does generating tokens affect the inference speed in generative inference?</td>\n",
              "      <td>\\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...</td>\n",
              "      <td>0.829982</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?</td>\n",
              "      <td>\\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...</td>\n",
              "      <td>0.928409</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>When is offloading used on the A100 server for accelerating MoE-based language models?</td>\n",
              "      <td>\\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...</td>\n",
              "      <td>0.760017</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How does Mixtral compare to Llama 2 70B in code benchmarks?</td>\n",
              "      <td>\\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...</td>\n",
              "      <td>0.954940</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c02928f8-89fd-45fb-9f28-53ae42eca08c')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-c02928f8-89fd-45fb-9f28-53ae42eca08c button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-c02928f8-89fd-45fb-9f28-53ae42eca08c');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-8225da34-5327-46ee-9165-fd79b0ae1567\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-8225da34-5327-46ee-9165-fd79b0ae1567')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-8225da34-5327-46ee-9165-fd79b0ae1567 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                                                       question  \\\n",
              "0                                                   What is the impact of encoding the input prompt on inference speed in generative inference?   \n",
              "1                                                                How does generating tokens affect the inference speed in generative inference?   \n",
              "2  How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?   \n",
              "3                                                        When is offloading used on the A100 server for accelerating MoE-based language models?   \n",
              "4                                                                                   How does Mixtral compare to Llama 2 70B in code benchmarks?   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        answer  \\\n",
              "0  \\nThe paper discusses that the generative inference workload has two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once for the entire prompt.\\n\\n2) Generating tokens conditioned on the prompt: This phase runs sequentially token-by-token and layer-by-layer. In this phase, each layer needs to be loaded once per generated token. So inference speed is limited by how fast parameters can be fetched from memory.\\n\\nThe key difference in terms of impact on inference speed is that the prompt encoding phase works well with existi...   \n",
              "1  \\nThe paper discusses that the generative inference process in large language models consists of two main phases:\\n\\n1) Encoding the input prompt: This phase encodes the prompt tokens in parallel layer-by-layer. Existing Mixture-of-Experts algorithms work well for this phase since each layer only needs to be loaded once per prompt.\\n\\n2) Generating tokens: This phase generates tokens sequentially, token-by-token and layer-by-layer. This means that each layer needs to be loaded once per generated token. As a result, the inference speed becomes limited by how fast parameters can be fetched from memory. \\n\\nThe key insight is that phase 2 (generating tokens) accounts for the majority of end...   \n",
              "2  \\nThe key differences between the architectures of Mixtral 8x7B and Mistral 7B are:\\n\\n1. Each layer in Mixtral is composed of 8 feedforward blocks (experts), compared to 1 feedforward block per layer in Mistral. \\n\\n2. In Mixtral, a router network selects 2 out of the 8 experts to process each token at every layer. So while each token has access to 47B parameters in total, only 13B parameters are active during inference for each token.\\n\\n3. Mixtral was trained with a larger context size of 32k tokens compared to Mistral.\\n\\nIn summary, Mixtral introduces sparsity in the feedforward layers through its mixture of experts architecture. This allows it to scale up to much larger overall par...   \n",
              "3  \\nThe paper discusses using offloading strategies to run large MoE (Mixture of Experts) language models on consumer hardware with limited GPU memory. Some key points:\\n\\n- MoE models have sparse activations, where only a fraction of experts are active on each input. This allows faster inference than dense models.\\n\\n- But MoE models are very large, making them difficult to run on consumer GPUs. For example, Falcon-180B needs ~360GB memory. \\n\\n- The paper proposes MoE-specific offloading that takes advantage of:\\n  - Expert reuse between tokens (using an LRU cache)\\n  - Early layer states predicting later active experts. This allows prefetching experts.\\n  \\n- They use these strategies t...   \n",
              "4  \\nBased on the information from the arXiv paper, Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference. Specifically:\\n\\n- Mixtral 8x7B outperforms Llama 2 70B on metrics like MMLU, commonsense reasoning, world knowledge, reading comprehension, math and code benchmarks. It has better performance despite having significantly smaller model capacity.\\n\\n- Mixtral uses 13B active parameters per token during inference compared to 70B for Llama 2. With 5x lower active parameters, Mixtral still outperforms Llama 2 70B on most categories.\\n\\nSo in summary, Mixtral compares very favorably to Llama 2 70B on c...   \n",
              "\n",
              "   answer_relevancy  \n",
              "0          0.812683  \n",
              "1          0.829982  \n",
              "2          0.928409  \n",
              "3          0.760017  \n",
              "4          0.954940  "
            ]
          },
          "execution_count": 37,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pd.set_option(\"display.max_colwidth\", 700)\n",
        "result[[\"question\", \"answer\", \"answer_relevancy\"]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NCXWYnBlr69a"
      },
      "source": [
        "Again we can see poorer performance from our fourth answer but the remainder (particularly answer with similarity greater than `0.9`) perform well."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CVaNDvKwl7BE"
      },
      "source": [
        "---"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "ml",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}